Picture for Colin Raffel

Colin Raffel

Shammie

FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

Add code
Jan 29, 2026
Viaarxiv icon

Efficiently Estimating Data Efficiency for Language Model Fine-tuning

Add code
Dec 31, 2025
Viaarxiv icon

TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior

Add code
Dec 23, 2025
Viaarxiv icon

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Add code
Jun 26, 2025
Viaarxiv icon

The Butterfly Effect: Neural Network Training Trajectories Are Highly Sensitive to Initial Conditions

Add code
Jun 16, 2025
Viaarxiv icon

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Add code
Jun 05, 2025
Viaarxiv icon

Enhancing Training Data Attribution with Representational Optimization

Add code
May 24, 2025
Viaarxiv icon

Position: The Most Expensive Part of an LLM should be its Training Data

Add code
Apr 16, 2025
Figure 1 for Position: The Most Expensive Part of an LLM should be its Training Data
Figure 2 for Position: The Most Expensive Part of an LLM should be its Training Data
Figure 3 for Position: The Most Expensive Part of an LLM should be its Training Data
Figure 4 for Position: The Most Expensive Part of an LLM should be its Training Data
Viaarxiv icon

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Add code
Feb 04, 2025
Figure 1 for SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Figure 2 for SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Figure 3 for SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Figure 4 for SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Viaarxiv icon

AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution

Add code
Nov 22, 2024
Figure 1 for AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution
Figure 2 for AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution
Figure 3 for AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution
Figure 4 for AttriBoT: A Bag of Tricks for Efficiently Approximating Leave-One-Out Context Attribution
Viaarxiv icon